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Abstract. We analyze completely the convergence speed of the 
batch learning algorithm, and compare its speed to that of the 
memoryless learning algorithm and of learning with memory (as 
analyzed in | KR200l'b| ). We show that the batch learning algo- 



rithm is never worse than the memoryless learning algorithm (at 
least asymptotically). Its performance vis-a-vis learning with full 
memory is less clearcut, and depends on certain probabilistic as- 
sumptions. 



Introduction 

The original motivation for the work in this paper was provided 
by research in learning theory, specifically in various models of lan- 
guage acquisition (see, for example, ||KNN2001| , |NKN2001i [KN2001|| ). 



In the paper [|KR2001b|| , we had studied the speed of convergence of the 



memoryless learner algorithm, and also of learning with full memory. 
Since the batch learning algorithm is both widely known, and believed 
to have superior speed (at the cost of memory) to both of the above 
methods by learning theorists, it seemed natural to analyze its behav- 
ior under the same set of assumptions, in order to bring the analysis 
in |KK2001a[] and ||KK2001"E| to a sort of closure. It should be noted 
that the detailed analysis of the batch learning algorithm is performed 
under the assumption of independence, which was not explicitly present 
in our previous work. For the impatient reader we state our main result 
(Theorem |6.1|) immediately (the reader can compare it with the results 
on the memoryless learning algorithm and learning with full memory, 
as summarized in Theorem U7i\) : 



Theorem A. Let be the number of steps it takes for the student 
to have probability 1 — A 0/ learning the concept using the batch learner 
algorithm. Then we have the following estimates for N/^: 
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• if the distribution of overlaps is uniform, or more generally, the 
density function /(I — x) at has the form f{x) = c + 0{x^), 
S,c> 0, then there exist positive constants Ci,C2 such that 



• if the probability density function /(I — x) is asymptotic to cx^ + 
0(^x'^+^), 6,P > 0, as X approaches 0, then 



for some positive constants Ci, 02; 
• if the asymptotic behavior is as above, but —l<f3<0, then 



The plan of the paper is as follows: in this Introduction we recall the 
learning algorithms we study; in Section |l| we define our mathematical 
model; in Section 2 we recall our previous results, in Section 3 we begin 
the analysis of the batch learning algorithm, and introduce some of the 
necessary mathematical concepts; in Sections 4-6 we analyze the three 
cases stated in Theorem A, and we summarize our findings in Section 
7. 

Memoryless Learning and Learning with Full Memory. The 

general setup is as follows: There is a collection of concepts Rq, . . . , Rn 
and words which refer to these concepts, sometimes ambiguously. The 
teacher generates a stream of words, referring to the concept Ro- This is 
not known to the student, but he must learn by, at each step, guessing 
some concept Ri and checking for consistency with the teacher's input. 
The memoryless learner algorithm consists of picking a concept Ri at 
random, and sticking by this choice, until it is proven wrong. At this 
point another concept is picked randomly, and the procedure repeats. 
Learning with full memory follows the same general process with the 
important difference that once a concept is rejected, the student never 
goes back to it. It is clear (for both algorithms) that once the student 
hits on the right answer Rq, this will be his final answer. We would 
like to estimate the probability of having guessed the right answer is 
after k steps, and also the expected number of steps before the student 
settles on the right answer. 
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Batch Learning. The batch learning situation is similar to the above, 
but here the student records the words wi, . . . ,Wk, ■ ■ ■ he gets from the 
teacher. For each word Wi , we assume that the student can find (in his 
textbook, for example) a list Li of concepts referred to by the word. If 
we define 

k 

= Li, 
i=l 

then we are interested in the smallest value of k such that = {-Ro}- 
This value /cq is the time it has taken the student to learn the concept 
Rq. We think of ko as a random variable, and we wish to estimate its 
expectation. 

1. The mathematical model 

We think of the words referring to the concept Rq as a probability 
space V. The probability that one of these words also refer to the 
concept Ri shall be denoted by Pi] the probability that a word refers 
to concepts Ri^,...,Ri^, shall be denoted by Pi^...ik- All the results 
described below (obviously) depend in a crucial way on the pi, . . . ,p„ 
and (in the case of the batch learning algorithm) also on the joint 
probabilities. Since there is no a priori reason to assume specific values 
for the probabilities, we shall assume that all of the pi are themselves 
independent, identically distributed random variables. We shall refer 
to their common distribution as J-", and to the density as /. It turns 
out that the convergence properties of the various learning algorithms 
depend on the local analytic properties of the distribution JF at 1 - 
some moments refiection will convince the reader that this is not really 
so surprising. 

Sharper analysis of the batch learning algorithm, depends on the 
independence hypothesis: 

— Ph • • • Pik ■ 

It is again not too surprising that some such assumption on correla- 
tions ought to be required for precise asymptotic results, though it is 
obviously the subject of a (non-mathematical) debate as to whether 
assuming that the various concepts are truly independent is reasonable 
from a cognitive science point of view. 



2. Previous results 

In previous work KR2001a and [ KR2001'E | we obtained the follow- 
ing result. 
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Theorem 2.1. Let N/^ be the number of steps it takes for the student 
to have probability 1 — A of learning the concept. Then we have the 
following estimates for N^: 

• if the distribution of overlaps is uniform, or more generally, the 
density function f{l — x) at has the form f{x) = c + 0{x^), 
S,c > 0, then there exist positive constants Ci,C2,C[,C'2 such 
that 

lim p fci < — < a 

n^oo y |logA|niogn 
for the memoryless algorithm and 



n^oo \ I i — A)^nlogn 

when learning with full memory; 

if the probability density function /(I — x) is asymptotic to cx^ + 
0(^x^+^), S, (3 > 0, as X approaches 0, then for the two algo- 
rithms we have respectively 

lim P ( ci < J- T-j— < C2 



n—^oo 



log A|n 



and 



lim P ( c' < ^t.. < ci 



n— >oo 



;i - A)2n 

for some positive constants ci, C2, c[, c'2; 
• if the asymptotic behavior is as above, but —1<(3<0, then 

n ^ / 1 Na \ ^ 

lim P - < . I < X = 1 

x^oo \x I logA|nV(i+/?) J 

for the memoryless learning algorithm, and similarly 
lim P I - < -; A^^i /n , ^\ < x] — 1 

for learning with full memory. 

Recall that f{x) = Q{g{x)) means that for sufficiently large x, the ra- 
tio f{x)/g{x) is bounded between two strictly positive constants. The 
distribution of overlaps referred to above is simply the distribution T. 
Notice that the theorem says nothing about the situation when is 
supported in some interval [0,a], for a <1. That case is (presumably) 
of scientific interest, but mathematically it is relatively trivial: we re- 
place the arguments of all the 6s above by 1, though, of course, we are 
thereby hiding the dependence on a. 
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3. General bounds on the batch learner algorithm 

Consider a set of words wi, . . . ,Wk. The probability that they all 
refer to the concept Ri is, obviously p^. 

Lemma 3.1. The probability that we still have not learned the con- 
cept Rq after k steps is bounded above by XlILi^'i ' ^'^^ below by maxjp^. 

Proof. Immediate. □ 

We will first use these upper and lower bounds to get corresponding 
bounds on the convergence speed of the batch learner algorithm, and 
then invoke the independence hypothesis to sharpen these bounds in 
many cases. 

We begin with a trivial but useful lemma. 

Lemma 3.2. Let G be a game where the probability of success (respec- 
tively failure) after at most k steps is Sk (respectively fk = 1 — Sk). 
Then the expected number of steps until success is 



oo 



k{sk - Sk-i) = y^gfc = 1 - y^/fc. 



k=l k=l k=l 

if the corresponding sum converges. 

Proof. The proof is immediate from the definition of expectation and 
the possibility of rearrangment of terms of positive series. □ 

We can combine Lemma and Lemma 13.11 to obtain: 



Theorem 3.3. The expected time T of convergence of the batch learner 
algorithm is bounded as follows: 



n 

(1) Er^ 



> T > max 



^ 1 - Pi l<i<n I- Pi 

The leftmost term in equation (|I]) has been studied at length in 
|KK2001a|| . We state a version of the results of ||KK2001a|| below: 

Theorem 3.4. Let S = J2^=i where the Pi are independently 
identically distributed random variables with values in [0, 1], with prob- 
ability density f, such that /(I — x) = + 0{x^~^^), S > forx 0. 
Then If (3 > 0, then there exists a mean m, such that \im.n-^^W{\S / n — 
m| > e) = 0, for any e > 0. If P = 0, then lim„^oo ^{\S/ (nlogn) — 1| > 
e) = 0). Finally, if -I < P < 0, then lim„^oo - C > a) = 

g{a), where lima^oo 9 {O') = 0, andC is an arbitrary (but fixed) constant, 
and likewise 

¥{S/n}'^^+^^ <h)= hib), 

where lima^o h{a) = 0, 
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The right hand side of Eq. (|l|) is easier to understand. Indeed, let 
Pi, . . . ,Pn he distributed as usual (and as in the statement of Theorem 
T§. Then 

Theorem 3.5. 

lim ?T,i+^E ( 1 — max pi] = C, 

n— >oo \ l<i<n J 

for some positive constant C . 

Proof. First, we change variables to qi = 1 — pi. Obviously, the state- 
ment of the Theorem is equivalent to the statement that 

E = E( min g^) = Cn-^'^^l^ . 

l<i<n 

We also write h{x) = /(I — x), and let H be the distribution function 
whose density is h, so that H{x) = 1 — F{1 — x). Now, the probability 
of that all of the are greater than t equals 1 — (1 — if(t))", so that 



E = [ t d[l- {1- H{t)Y]= f {l-H 
Jo Jo 



We change variables t = u/n^^^^~^^\ to obtain 



(2) 



E 



n 



1+/3 



l-H 



u 



Let us write E = Ei{n) + E2{n), where 
(3) 



Ei(n) 



Eoin) 



.3{/3+l) 



n3{/3+i) 



l-H 



l-H 



nl/(l+/3) 



u 

nl/(l+/3) 
U 

^i/(i+/3)yj 



- H{t)Ydt. 



du. 



du. 



du. 



(4) 

Recall that 

(5) 
Let 

(6) 

We now show: 

(7) limEi(n)=J. 

n— >oo 

This is an immediate consequence of Lemma p.7| and Eq. (^. Also, 

(8) lim E2{n) = 0. 



1 

exp (cx^^^) dx. 



I 
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Since if is a monotonically increasing function, it is sufficient to show 
that 



hm ni+zJ 



I- H 



0. 



This is immediate from Eq. (^) and Lemma 3.7 



□ 



Remark 3.6. The argument shows that C = I, where C is the con- 
stant in the statement of lemma, and I is the integral introduced in Eq. 

Iff- 



2n 



Lemma 3.7. Let = (1 - x/n)"', and let < z < 1/2. 

fn{x) = exp(-x) 
Proof. Note that 

log/„(x) = nlog(l — x/n) = —x — 



OO t. 

X 



k=2 



The assertion of the lemma follows by exponentiating the two sides of 
the above equation. □ 

We need one final observation: 

Theorem 3.8. The variable n^/'-"'^"^^^ min"^]^ has a limiting distribu- 
tion with distribution function G{x) = 1 — exp(— x^"*"^). 

Proof. Immediate from the proof of Theorem p.5|. □ 



We can now put together all of the above results as follows. 

Theorem 3.9. Let pi, . . . ,pk be independently distributed with com- 
mon density function f , such that /(I — x) = cx^ + 0{x^^^), 6 > 0. 
Let T be the expected time of the convergence of the batch learning 
algorithm with overlaps pi,...,pk. Then, if P > 0, then there ex- 
ist Ci,C2, such that Cin^^^^~^^^ ^ T < C2n, with probability tending 
to 1 as n tends to oo. If f3 = 0, then there exist Ci,C2, such that 
Ciu <T< C2n\ogn, with probability tending to one as n tends to oo. 
If(3>0, then C-in^/^^+i) < T < Cn^/(-f^+^^ with probability tending to 
as C goes to infinity. 



The reader will remark that in the case that (3 > 0, the upper and 
lower bounds have the same order of magnitude as functions of n. 
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4. Independent concepts 

We now invoke the independence hypothesis, whereby an apphcation 
of the inclusion-exclusion principle gives us: 

Lemma 4.1. The probability Ik that we have learned the concept Ro 
after k steps is given by 



1=1 



Note that the probability Sk of winning the game on the k-th step 
is given by Sk = h — h-i = (1 — h~i) — (1 — h)- Since the expected 
number of steps T to learn the concept is given by 



k=l 



we immediately have 

oo 
k=l 

Lemma 4.2. The expected time T of learning the concept Rq is given 
by 

oo / n \ 

(9) i-n(i-rf) ■ 

k=i \ 1=1 J 

Since the sum above is absolutely convergent, we can expand the 
products and interchange the order of summation to get the following 
formula for T: 

Notation. Below, we identify subsets of {1, . . . ,n} with multindexes 
(in the obvious way), and if s = {ii, . . . , ii}, then 

def 

Ps = Ph---Pir 
Lemma 4.3. The expression Eq. ^ can be rewritten as: 

(10) T= (-1)'"- - l) . 

sC{l,...,n} \ /^s / 

Proof. With notation as above, 

m 

i=l sC{l,...,n} 
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SO 



k=l 



sC{l,...,n} 



k=l 



sC{l,...,n} 



fe=l 



|l,...,n| \ /^s / 



sC{l,...,n} 

where the change in the order of summation is permissible since all 
sums converge absolutely. □ 



Formula (|10D is useful in and of itself, but we now use it to analyse the 
statistical properties of the time of success T under our distribution 
and independence assumptions. For this we shall need to study the 
moment zeta function of a probability distribution, introduced below. 



Its detailed properties are investigated in my paper |[Rivin200^ , where 
4.101 and |4.11| below are proved. Below we summarize 



Theorems 
the definitions and the results. 



4.1. Moment zeta function. 

Definition 4.4. Let J-" be a probability distribution on a (possibly in- 
finite) interval I, and let mk{J^) = fjX^J-'{dx) be the k-th moment of 
T . Then the moment zeta function of JF is defined to be 

oo 
k=l 

whenever the sum is defined. 

The definition is, in a way, motivated by the following: 

Lemma 4.5. Let be a probability distribution as above, and letxi, . . . ,x. 
be independent random variables with common distribution T . Then 

(11) Ef- ^ \=CAn). 

In particular, the expectation is undefined whenever the zeta function 
is undefined. 
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Proof. Expand the fraction in a geometric series and apply Fubini's 
theorem. □ 

Example 4.6. For T the uniform distribution on [0, 1], (jr is the fa- 
miliar Riemann zeta function. 

Using standard techinques of asymptotic analysis, the following can 
be shown (see |[Rivin2002[| ): 



Theorem 4.7. Let J-" be a continuous distribution supported in [0, 1], 
let f be the density of the distribution T , and suppose that /(I — x) = 
cx^ + 0{x^^^), for some 5 > 0. Then the k-th moment of is asymp- 
totic to Ck-^^+^\ for C = cT{(3). 

Corollary 4.8. Under the assumptions of Theorem \(.% Cj^(s) is de- 
fined for s > 1/(1 + /3). 

The moment zeta function can be used to two of the three situations 
occuring in the study of the batch learner algorithm: In the sequel, we 
set a = /5 + 1. 

4.2. a > 1. In this case, we use our assumptions to rewrite Eq. (p!OD 

as 

n 



:i2) E(T) = -gQ(-l)'^C^(A;) 



This, in turn, can be rewritten (by expanding the definition of zeta) as 

oo oo 

(13) E(T) = - 5^ [(1 - m,{J^)r - 1] = E [1 - (1 - ^^•(•^))"] 
i=i i=i 
Using the moment zeta function we can show: 

Theorem 4.9. Let J-" be a continuous distribution supported on [0, 1], 
and let f be the density of J^. Suppose further that 

lim , '^^ \^ = c, 

x->i(l-x)/5 

for P,c> 0. Then, 



lim n 

n— >oo 



E(:)(-ife(^-) 

A:=l ^ ^ 



1 - exp (-cT{(3 + 1)m1+^) 

-du 







/3 
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4.3. a = 1. In this case, 

(14) fix) = 1 + 0(1) 
approaches 1, and so Theorem tells us that 

(15) lim jmj{T) = L. 

It is not hard to see that Cr{f^) is defined for n>2. We break up the 
expression in Eq. (p!oD as 

j=l sC{l,...,n}, ls|>l ^ 

Let 

^ 1 

- 1, 



E (-i)'"-'(r^-i' 

sC{l,...,n}, |s|>l ^ 

The first sum Ti has no expectation, however Ti/n does have have a 
stable distribution centered on clog?7, + C2- We will keep this in mind, 
but now let us look at the second sum T2. It can be rewritten as 

(17) Un) = - 5^ [(1 - m,(^))" - 1 + nm,(^)] . 

i=i 

We can again use the moment zeta function to analyse the properties 
of T2, to get: 

Theorem 4.10. Let J-' be a continuous distribution supported on [0, 1], 
and let f be the density of J-'. Suppose further that 

lim -M^ =00. 

x^l (1 - x) 

Then, 



f2(fli-^)'(Ak)-cn\ogn. 

k=2 ^ ^ 



To get error estimates, we need stronger assumption on the function 
/ than the weakest possible assumption made in Theorem [4.10| . 

Theorem 4.11. Let be a continuous distribution supported on [0, 1], 
and let f be the density of J^. Suppose further that 

f{x)^c{l-x) + 0{{l-x)'), 
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where 6 > 0. Then, 

J2 ~ cn\ogn + 0{n). 

k=2 ^ ^ 

The conclusion differs somewhat from that of section |4.2| in that we 
get an additional term of cnlogn, where c = lima^.^! f{x) = limj^ooj'^j- 
This term is equal (with opposing sign) to the center of the stable law 
satisfied by Ti, so in case a = 1, we see that T has no expectation but 
satisfies a law of large numbers, of the 

Theorem 4.12 (Law of large numbers). There exists a constant C such 
that limj^^oo P{\T/n - C\ > y) = 0. 



5. a <1 

In this case the analysis goes through as in the preceding section 
when a > 1/2, but then runs into considerable difficulties. However, 
in this case we note that Theorem actually gives us tight bounds. 



6. The inevitable comparison 

We are now in a position to compare the performance of the batch 
learning algorithm with that of the memoryless learning algorithm and 



of learning with full memory, as summarized in Theorem |2.1| . We 
combine our computations above with the observation that the batch 
learner algorithm converges geometrically (Lemma [4.1|) , to get: 



Theorem 6.1. Let be the number of steps it takes for the student 
to have probability 1 — A o/ learning the concept using the batch learner 
algorithm. Then we have the following estimates for N^: 

• if the distribution of overlaps is uniform, or more generally, the 
density function /(I — x) at has the form f{x) = c + 0{x^), 
S,c> 0, then there exist positive constants Ci, C2 such that 

hm P (c, < < C, 

• if the probability density function /(I — x) is asymptotic to cx^ + 
0{x'^~^^), 6,P > 0, as X approaches 0, then 

lim P I ci < — — — < C2 1 = 1 

n-*oo y |logA|n^ 

for some positive constants ci, C2; 
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if the asymptotic behavior is as above, but — 1 < /3 < 0, then 



lim P 



< 



X |logAbV(i+/3) 



< X 



1 



Comparing Theorems |2.1| and |6.1| , we see that batch learning algo- 
rithm is uniformly superior for /3 > 0, and the only one of the three 
to achieve sublinear performance whenever > (the other two never 
do better than linearly, unless the distribution is supported away 
from 1.) On the other hand, for /3 < 0, the batch learning algorithm 
performs comparably to the memoryless learner algorithm, and worse 
than learning with full memory. 
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